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Wavelet Document Image Compression 

Field Of the Invention 

Tie present invention relates to a novel image compression technique for classifying, 
matching and identifying document images based on wavelet compression method. 
This technique is called wavelet document image compression (WDIC) technique. 
More specifically, WDIC technique relates to separate the character/lines and pictures 
from the backgrounds of an original document images and to use different techniques 
to compress each of those components. More generally, this technique may also be 
applied to other special documents such as particularly important historical 
documents, scientific papers with mathematical or chemical formulae, software 
documents and some handwritten signatures. 

Background Of the Invention 

As electronic storage, retrieval and distribution of documents becomes fester and 
cheaper, a lot of documents are becoming increasingly digitally. In the last decade 
existing documents are usually retyped and converted to HTML or Adobe's PDF 
-formal sometimes are used by Optical Character Recognition (OCR) technique. 
Unfortunately, these techniques are still fer from being able to translate faithfully a 
scanned document into web page, much of the visual aspect of the original document 
is likely to be lost Recently, several authors [1-4] have proposed image-based 
approaches to digital documents. The "image-based approach" to digital documents is 
to store and to transmit documents as image. Traditional image compression standards 
such as JPEG and GIF are inappropriate for document image. Although they are 
suitable for continuous-tone image (Le. for most pictures of natural scenes, they are 
not for the sharp edges of character images. In the other hand, a scan document tends 
to be quite large if one wants to preserve the readability of the text It is needed to 
develop an approach for compression document images that makes it possible to 
transfer a high-quality of one page of document image at very high compression ratio, 
the WDIC document image compression technique described here is designed to 
overcome all the above problems. 

Objective Of the Invention 

The object of the invention is to provide a novel image compression technique 
(WDIQ for classifying, matching and identifying document images. A more specific 
object is to provide a wavelet-based compression algorithm for picture images, and a 
novel extent-based morphological matching, clustering and wavelet compression 
algorithm for mostly small character/lines images. 

Summary Of the Invention 

The invention comprises a number of novel algorithms for an improved document 
image compression technique. Hie main idea of our document image compression 
technique is to extract two main categories of picture areas and charactetfline areas 
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from the document image and encode the residue image by subtracted these two 



The character image can be encoded with a novel extent-based morphological 
matching, clustering and wavelet compression algorithm. A picture image can be 
encoded with a wavelet-based compression algorithm, which is suitable for grey scale 
images. The background image also can be encoded with a wavelet-based SAQ 



WDIC is a progressive code. It provides progressive decoding not only on 
background, but also on character images. 



Detailed Description Of the Invention 

In the following sections the novel techniques of the WDIC are described. The 
features of WDIC comprise special image segmentation for a document image, fast 
classification, morphological matching and clustering algorithm for character/lines 
images, a wavelet-based compression algorithm for picture images. Results from an 
actual system of the WDIC showed mat the novel means contribute significant 
performance improvement to two aspects of: highly efficient compression format and 
ai — ■ • •• - 



Encoder 10 

Figure 1 is the block diagram of the encoder 10. We start our process from a scanned 
grey scale document image 101 with scanned resolution r dpi. Process 102 extracts 
picture image blocks 103 from 101. 103 will be encoded by wavelet based SAQ 
encoder 104 ([5],[6]). Encoder 104 passes the compressed bit stream 105 to the 
process 118. 

By subtracting foe picture images 103 from 101, the residue image 106 is further 
processed by 107. Process 107 generates foe connective blocks from residue image 
106 based on region growing algorithm. * 

At first, we posterize the image into 3 levels aa below. 

0 yvhenI(v)tP 

F(v)=» 1 whenmzi(v)<P,whereI(v)istheMewityofthepixelatv=(v x ,v ). 

2 whenI(v)<l2S * ' 

27ie following algorithm below performs at all vntraeed pixels u with 
F(u)=2. 

1. S = t,S { ={u), W=j^xC, 

C is slightly larger tiion font size of most characters/letter, (default C = 24) 

2. Find reS x , {v,}£, represent eigat neighbor pixels of V in 
clockwise order, among them {VpVj.Vj.v,} are 4 -neighbor pixels 
of v . Define V Ml = v,.keZ . 5 =S[){v), S, = S t \{v) 
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3. for V|,/=/,...,», {v^-u^W.and^-u^W 

a. i/f = m7. 

i. ifF(v)=2<md(F(v i )=2or(F(v,)=landF(v l .,)+F(v M )>l)) 

a. ifF(v)=land(F(v l )=2<md(F(v M )+F(v M )22)) 
thenS^SiUlv,) 

b. ifi =2,4,6,8, 

if F(v)=2and(F(v,)=2 andF(v,. 1 )+F(v M )±l) 

4. ifS l *#.gotostqp2 

5. A ■ {(x,y) Ix^ZxZ x^, y^ZyZ y^} is character image block, 
vherex^ = min{ V J, Xma ^^{vj.y^-minfv^.y^ =rnuc{v y } 

After character image block A is extracted and saved into the 
character block list. We mark the pixels in this block as the traced 
pixels and change their value to. 255. And same procedure starts from 
untraced pixels satisfying F(a)=2until no such pixel exists. 

Character images 108 are the blocks representing the lines and characters extracted by 
107 from 106. Process 109 clusters the character images hierarchically. We will 
elaborate the process 109 in 401 to 413. Process 109 outputs data 110 comprising the 
character template library and the code of the every character blocks outputted from 
109. The code of character blocks includes the absolute coordinates of the block in me 
original image and the index of the template it uses. Character encoder 111 encodes 




Hie output 112 is compressed bit stream for the characters. Whistle the data 112 is 
passed to theprocess 118, it will be decoded by me decoder 113 whichis the 
counterpart of SAQ encoder 111. The reconstructed character images 114 are used to 
get the background image 115. Process 116 is encoder of wavelet based SAQ encoder 
for grey scale image ([5], [6]). The compressed bit stream 117 for background image 
is passed to the process 118. Theprocess 118 organizes the compressed bit stream of 
picture image blocks, character image blocks and the background image to generate 



Data 119 is organized as the following. We save the document image header and 
character codes of character blocks and location information of picture image blocks 
first; then the compressed bit planes corresponding value greater man 2 7 of character 
template library, picture image blocks and background image will be stored: finally 
fte residue bit plane information will be added one bit plane followed by another from 
the most significant bit plane to the least significant bit plane. Such organization 
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guarantees the progress decoding of die document image. In the other words, we can 
obtain the do cumeat image from blur version to die finest one. 



Picture image block extractor 102 

Picture image block extractor 102 is elaborated in the following. 

Process 301 estimate the peak value P Q of histogram of document image* threshold 
P=n28+P 0 ;/2,fliepkdsofinteiisityofpixdlessthm P are classified as 
foreground pixel, other pixels are background pixel. 

Process 302 partitions entire document image into blocks with size W x W where 
|log^r/4| 

W=2 l * J and r is the scanned resolution. 

Process 303 classify blocks to two types: picture block marked by 1 and nonpicture 
block marked it by 0. The verdict is based on the statistical features of wavelet 
decomposition of blocks. The procedure is as following. 

Using the wavelet filter to decompose die block once as conventional wavelet 
decomposition of image. For the computation efficiency, the sum of filter coefficients 
is 2 and the suggested filter for this procedure is Haar wavelet The figure below 
shows this procedure. LL, LH, HL and HH are the notations of lowest frequency 
component to highest frequency component as usual ([7]). 



LL 


LH 


HL 


HH 
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In generally, a document image is typically composed of a large portion of characters 
and edges regions, together with a rather small portion of homogeneous regions. 
Homogeneous regions have the least variation. Characters regions have moderate 
variation; and lines show the most variation. 
when\cb>A , 

■ 0 Qtheywise where^ is apredefined threshold (default A=16) andcisthe 
wavelet coefficients. 

Calculation the sums of wavelet coefficients. The statistical variable we used in 

Z,g(c u ) 

classification is following, count H ,whereH = HL\JLH\JHH 

1.5W^ 

average^ = (IJ *" where S u is the total number of wavelet coefficients ofZX. 

1fcount B <Band average u <fP+128J/2,where£ is toe predetermined threshold 
whose default value is 3, toe block is marked as picture block, otherwise it is madced 
as nonpicture block. 
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Switch 304 checks whether untraced picture block exists. If the answer is NO, all 
picture blocks are saved in data 316 already and finish process 102. 

Otherwise, the next untraced picture block is identified in step 305 and change its 
mark to zero, and the picture area is initialised to the minimum rectangle containing 
current block in step 306. 

Hie process 317 is to extract the rectangle area of picture image and consists of two 
steps. Firstly, process 318 extracts the picture blocks. Then this area will further grow 
to its neighbour pixels in process 319 if necessary. 

Switch 307 checks whether there is a neighbour block of current picture area whose 
mark is 1. If the answer is YES, mark this block 0 in 308 and extend picture area to a 
new rectangle area containing this block in process 309, go bade to switch 307. If the 
answer is NO, all neighbour blocks are not picture block. We finish process 318 and 
go to switch 310. 

Switch 310 checks that whether die rectangle picture area is big enough by comparing 
file length and width to die preset value (default 2W ) . If answer is NO, there is no 
picture area found and turn to switch 304. Otherwise, the answer is YES, we store the 
location information of the picture area in 311. 

Process 319 comprising Mowing steps refines die picture area. Switch 312 checks 
whether there is a fore-pixel in die neighbour pixels of current area. If the answer is 
YES, process 313 extends the picture area to die new rectangle picture area 
containing die found fore-pixel and we go bade switch 312. If the answer is NO, all 
neighbour pixels of current picture area are back-pixels. Process 319 finishes and save 
this rectangle picture as a picture image area in process 314. Process 31S appends this 
picture image area to the list of picture images then we go back to switch 304. 

Process 401 generate the style of characters 

^ = {(ij) I WJ) <P>* = * "Ij** 0JU, w- 1 ; where I(iJ) is intensity of 

pixel at coodinates (hj), w is width, h is height. P is estimated in step 301 

block distance of two pixels aredefined as 'ffWiAfWtJMfe'A l+l/2~Ji I 

=min( d ((W>(U))) , d m -minWW-WU))) 
*«=min(*((™~W*OJ))) , 4* -min(*«»-w-WU))) 

Hie style of this character is (yf t h,d ut d^ t d nt9 d tb ). 

We define three sets 1^,1^,1^ for process 109. Z^is the collection of character 
images blocks, i, is the collection of die character code information of the character 
image blocks. 1^ is library of character templates used to save the images of 
character templates. Switch 402 checks whether L$ is empty, if it answers YES, it 
means all character blocks have been processed, then data 403 comprising i, and £j 
will be outputted and terminate die process 109. Otherwise, die answer is NO in step 
402, we will get the next character block Tin Lq in process 404. 414 is process of 
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matching character block T against templates in i^. We start from die head of . 
Check whether all templates in 1^ have bemused in 406. 

If the answer is YES, it means T is a new type of character, in step 407 we append it 
to Lj as a new character template 21, ave fte code information of r against II to 
Ij, and remove 7 1 from then go bade to switch 402. 

Otherwise, if the answer is NO in 40$,get the character template TL from in step 
408. We match T against TL by two steps, first match T against TL m process 409 by 
their style, Switch 410 checks die result of process 409, if the answer is NO, go to 
406. If the answer is YES, thai match T against TL by morphological character 
matching method in 411. 

411 uses a morphological approach with which the matching of two characters is fast 
and accurate compared to the conventional matching method such as die matching by 
the grey scale similarity. The new measurement based on morphological is better than 
Euclidean distance measurement and Hausdorff measurement in the case of noise 
environment due to the stability of the measurement 

Hie new morphological operator measures the size of the difference image of two 
images (one is the template and the other is diameter block). Assume the two images 
are/and g 9 the difference image fi-g is defined as follow 

[0, otherwise 
threshold C M =3Z 

Hie different image fig is a binary image, in the other words, it is a binary s et 
Define the size of set A of structure dement B as e(A) B =sup{AoaB#$}.ae9l 

a 

where AoJJis normal morphological open operator. 

The new measurement of the difference between two binary sets can be defined as 
S B (f t g) -e(f -g) B > where Bis square structure element of sizeL 
The similarity measure of two sets/ g is M(f, g) = max {S B (f-g),S B (g-f)}. The 
new measurement is symmetric in the sense of the distortion is concave distortion or 
convex distortion; however, the Hausdorff measurement is not symmetric [8]. 

If the measure is less than the average size of the noise region, the matching is 
success. We develop a fast algorithm based on this theory for matching of character 
problem. The measure of fee difference is modified as M( f,g)-S B (f-g). For the 
matching for character image with resolution no less than 72, if the measure is less 
than 2, the matching of character against template is success. The algorithm is as 
following, 

Algorithm Mj 

1. Suppose (f -g)(x) is a sequence with length m. *<-0, 
2- if(f-g)(x) = 0,gotostepS 
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3. if(f-g)(x+l) = 0,(f-g)(x)<-0,gotostep5 

4. x*-x+l 

5. if(x<m-l)x*-x+l, gotostep2 

6. end 
Algorithm M 2 

2, Suppose (f - g)(x) is a sequence with length m. x <- 0, 

2. if(f-g)(x)=land(f-g)(x+l)=l t got6step5 

3. if(x<m-l)x<-x+l, gotostep2 

4. character matches against template, go to step 6 

5 . character does not match against template, go to step 6 

6. end 

The condition is weak or not depends on the structure element used in the algorithm 
Mi and die associated part of algorithm M 2 . Here the condition is strong means that it 
is difficult to match a character against template. On the contrary, the condition is 
weak means it is very easy to match a character against a template. Strong condition 
will decrease die compression ratio slightly but weak condition will generate false 
matching and the reconstructed character may not be correct when the scanned 
document image quality is very poor. The order of line, circle to square corresponds 
to the conditions from strong to weak. 

Note: 

1. If algorithm Mi performs only along row direction, the structure element used 
in the matching algorithm is line of horizontal direction. This element is good 
enough for the English character matching. 

2. If algorithm Mi performs only along column, the structure element used in the 
matching algorithm is line of vertical direction. 

3. If algorithm Mi performs along both row and column directions, the structure 
element used in the matching algorithm is circle. Circle structure element 
works well for character of most languages. 

4. If algorithm Mi performs along row direction followed by column direction 
and then performs along column direction followed by row direction, the 
structure element used in die matching algorithm is square. 

5. For the strocture element of lines we only need apply algorithm Ma along 
same direction as Mi does. For die structure element circle algorithm M* 
performs at either horizontal direction or vertical direction. For the structure 
element square, algorithm M2 performs at both horizontal and vertical 
directions before we can conclude that the match is success. 

Switch 412 checks whether T matches against TL. If the answer is NO, go back to 
switch 406.Otherwise, the answer is YES, information of T is appended to I| and 
code of T is index of pattern TL in 2^, then process 413 removes T from Zp, then we 
go to switch 402. 
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Dec d r20 

Figure 2 is WDIC decoder 20 that is the reverse process of encoder 10. Decoder 20 
starts from compressed bit stream 201 of document image. Process 202 separates 201 
to three parts based on the formats of compressed document image described in 118. 
These three parte are compressed bit stream 203 ofbackground image, compressed bit 
stream 206 of character image blocks and compressed bit stream 209 of picture image 
blocks. 

Data 203 is decoded by wavelet based S AQ decoder 204 to generate the background 
image. Data 206 is deooded by character decoder 207 to generate the information of 
character codes of character image blocks and character template library. Data 209 is 
decoded by wavelet based SAQ decoder 210 to generate the picture image blocks 211. 
Data 205, 208 and 21 1 will be combined to document image 213 in process 212. 

Claims 

What is claimed: 

1. The method of encoding of document image comprising of the steps of 
extracting two main categories of picture areas and character/line areas from 
the document image and, 

obtaining the residue image by subtracted these two categories areas from 
document images, and, 

classifying the character/line according to die templates dynamical generated, 
and 

encoding die residue image by wavelet based SAQ method. 

2. Hie method of extracting picture areas from document image according to 
claim 1. 

3. The method according to claim 2 comprising marking die blocks partitioned 
from document image based on features of their wavelet coefficients, 

4. The method according to claim 2, further comprising hierarchical picture area 
extracting method comprising steps of 

extracting the picture blocks first to generate the initial picture area and, 
refining the picture area to cover the neighbour picture pixels of original area. 

5. The method according to claim 1, further comprising the method of extracting 
character/line areas from document image, 

6. The method according to claim 5 comprising of special definition of the 
connectivity, 

7. The method according to claim 5, further comprising of die method of 
extracting the character/line areas, 

8. Hie method according to claim 1, further comprising the method of classifying 
charactei/line, 

9. The method according to claim8 comprising generating style of the 
character/line areas, 

10. The method according to claim 8 further comprising the hierarchical matching 
of die character/line area against character/line templates comprising steps of 
matching the styles of character/line areas against styles of templates first and, 
matching of the character/line areas against template, 

11. The method of matching of the character/line areas against template 
comprising of morphological matching 
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12. the method according to claim 1 1, comprising specific character/line area 
matching algorithms Mi and M2, 

1 3. The method according to claiml I, further comprising the method of using 
different structure element for the different kind of document image, 

14. The method according to claim 1, further comprising of bit plane storage of 
the compressed stream of the document image by the order of character/line, 
picture and background image which can be progressive decoding by 
associated decoder, 

15. The associated decoder of encoder in claim I. 
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